The idea of this R notebook is to introduce everyone interested in data science to effective communication of data and statistical findings with suitable visualisations. The ggplot2 and plotly packages are used for this purpose since they enable producing high-quality, publication-ready visualisations for static as well as dynamic and interactive applications. Both packages are built on the so-called “Grammar of Graphics”, a scientific syntax for effective data visualisations, which describes how specific elements or layers of a plot should be seperated and classified for a structured approach to visualisations. For more information, see Hadley Wickham (2010) - A Layered Grammar of Graphics and Wilkinson (2011) - The Grammar of Graphics.
Great resources to check out:
#
# Global chunk settings
# Color settings
palette(viridis(n = 10))
# palette(brewer.pal(n = 11, name = "RdYlGn"))
Download Tesla stock data (ticker = “TSLA”) from Yahoo Finance by using the quantmod package.
getSymbols(Symbols = "TSLA",
src = "yahoo",
verbose = T)
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## downloading TSLA .....
##
## done.
## [1] "TSLA"
Download S&P 500 index (ETF) data (ticker = “SPY”) from Yahoo Finance by using the quantmod package.
getSymbols(Symbols = "SPY",
src = "yahoo",
verbose = T)
## downloading SPY .....
##
## done.
## [1] "SPY"
Do some data wrangling to transform Tesla stock data into a tibble with the dplyr and tsbox packages and rename its columns.
df_Tesla_stock_data <- TSLA %>%
ts_tbl() %>%
ts_wide() %>%
rename(Date = time,
Open = TSLA.Open,
High = TSLA.High,
Low = TSLA.Low,
Close = TSLA.Close,
Volume = TSLA.Volume,
Adjusted = TSLA.Adjusted)
The Tesla stock data now looks like this, with daily observations for each trading day organised in the rows and seven different variables, also called attributes in the ML context, in the columns. For each of the daily 2’531 observations, we have the corresponding date in the Date column, the Openning stock price at trading start on the exchange, the daily Highest and Lowest price, the Close at end of trading, the trading Volume, and finally an Adjusted price, accounting for stock splits, dividends, and similar corporate actions.
datatable(df_Tesla_stock_data)
Do the same for S&P 500 index data
df_SPY_data <- SPY %>%
ts_tbl() %>%
ts_wide() %>%
rename(Date = time,
Open = SPY.Open,
High = SPY.High,
Low = SPY.Low,
Close = SPY.Close,
Volume = SPY.Volume,
Adjusted = SPY.Adjusted)
The S&P 500 (SPY) series has a few more observations than the Tesla series, i.e. data points on 3’409 days. Otherwise, it is in the same format. Here is how it looks like:
datatable(df_SPY_data)
# Join Tesla and S&P500 data
df_Tesla_SPY <- df_SPY_data %>%
full_join(df_Tesla_stock_data,
by = "Date",
suffix = c("SPY", "TSLA"))
In addition, we now scrap Tweets data from Elon Musk’s and Tesla’ official Twitter account with the rtweet package. Unfortunately, only the most recent 3’212 tweets per user are available, because Twitter limits access to historical data in order to commercially offer it instead. Tweet scrapping requires a Twitter account and a developer registration for the free Twitter API. This is fairly easy to set up, however, and should only take a couple of minutes.
df_tweets_elon_musk <- get_timeline("elonmusk", n = 5000)
df_tweets_tesla <- get_timeline("Tesla", n = 5000)
The Tweets dataset is fairly big in size with 90 columns. Thus, only a subset of the columns are shown here to get an idea of how the data set for Elon Musk’s tweets looks like:
df_tweets_elon_musk %>%
select(user_id, created_at, screen_name, text, source, is_quote,
is_retweet, favorite_count, retweet_count, hashtags) %>%
datatable(filter = "top")
…and Tesla’s official Twitter account:
df_tweets_tesla %>%
select(user_id, created_at, screen_name, text, source, is_quote,
is_retweet, favorite_count, retweet_count, hashtags) %>%
datatable(filter = "top")
Now we’re ready to take the Tesla stock price data and create a basic ggplot2 time series chart. We need the above mentioned “Grammar of Graphics” to set up each specific layer in the plot. First, we need to map the data to so-called aesthetics in the plot. Aesthetics are defined within the aes() function in ggplot2 and include plot specifications such as what goes on the x-axis and y-axis, what is shown in which colour, how the size of an object in a plot is determined and many more. For our basic time series plot, we simply map the Date column from the stock data to the x-axis and the Adjusted stock price to the y-axis. The only additional layer to add to get a finished plot now is a so-called geom. Geoms determine the kind of plot we want to display and are added with the set of geom_... functions. Here, we’d like to create a simple line plot with geom_line(). First, we add a new layer to the plot by simply using the + operator. Then we set the line geom and after saving the plot to a new R object we have our first plot.
p_basic_time_series_Tesla <- ggplot(data = df_Tesla_stock_data,
aes(x = Date, y = Adjusted)) + # Close
geom_line()
p_basic_time_series_Tesla
For a visual overview and additional explanations of the different layers in ggplot2’s Grammar of Graphics, see this Towards Data Science article: So far, so good. However, the plot doesn’t look particularly great, does it? The grey background is rather irritating, the date on the x-axis is only displayed every five years, it’s unclear in what units the y-axis is shown, and in general, there’s no title or anything to really indicate what is exactly shown here. The only information we have is the evolution of the series over a time period of 10 years and its corresponding values on the y-axis. Thus, next, we adjust the scales of the x- and y-axes in a new layer, the scales layer. We copy the code from above and additionally add
scale_x_... and scale_y_.. functions with proper arguments.
p_basic_time_series_Tesla_w_scales <- p_basic_time_series_Tesla +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar,
breaks = seq(from = 0, to = 1750, by = 250))
p_basic_time_series_Tesla_w_scales
The theme of a plot is yet another layer in the “Grammar of Graphics”. Setting a beautiful theme will help us to get rid of the irritating grey background. Let’s try the theme_classic() function.
p_basic_time_series_Tesla_w_scales_and_theme <- p_basic_time_series_Tesla_w_scales +
theme_classic()
p_basic_time_series_Tesla_w_scales_and_theme
theme_classic() is quite a beautiful and simplistic theme. For the purpose of interpreting a time series plot, however, a theme including a grid is more appropriate. Thus, in the following plots we use theme_light() instead. We also would like to add a proper title. Plot main and subtitles as well as axis labels are set with the labs() function. Let’s also adjust the y-axis label to make clearer what it represents. Finally, let’s add a caption with the copyright for the plot. Now we have our first complete time series plot.
p_basic_time_series_Tesla_w_scales +
theme_light() +
labs(title = "Tesla Stock Price - Rising Higher and Higher...",
y = "Close (Adjusted)",
caption = "© Data Science & Technology Club HSG")
For the following plots, let’s set a global default ggplot2 theme, instead of adding it manually to each plot.
theme_set(theme_light())
To improve further on our plot, we can add a so-called benchmark to it. A benchmark is, e.g., another time series to compare the Tesla stock price with. We use the previously gathered S&P 500 prices to do exactly that. In order to be able to compare the prices of the two series and to get them into the same y-axis limits, some data wrangling and rebasing is required.
df_Tesla_SPY <- df_Tesla_SPY %>%
mutate(AdjustedTSLARebased = AdjustedTSLA / first(df_Tesla_stock_data$Adjusted),
AdjustedSPYRebased = AdjustedSPY / first(df_SPY_data$Adjusted))
p_time_series_Tesla_vs_SPY <- df_Tesla_SPY %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = AdjustedTSLARebased), col = palette()[4]) +
geom_point(aes(x = last(Date),
y = last(AdjustedTSLARebased)),
col = palette()[4],
size = 2) +
geom_text(label = "TSLA",
aes(x = last(Date),
y = last(AdjustedTSLARebased)),
color = palette()[4],
hjust = 1.2,
vjust = -1) +
geom_line(aes(y = AdjustedSPYRebased), col = palette()[1]) +
geom_point(aes(x = last(Date),
y = last(AdjustedSPYRebased)),
col = palette()[1],
size = 2) +
geom_text(label = "S&P 500",
aes(x = last(Date),
y = last(AdjustedSPYRebased)),
color = palette()[1],
hjust = 1.2,
vjust = -1) +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::percent,
breaks = seq(from = 0, to = 70, by = 10)) +
labs(title = "Tesla Stock Price - Tesla vs. S&P 500",
y = "Price Rebased (%)",
caption = "© Data Science & Technology Club HSG")
p_time_series_Tesla_vs_SPY
## Warning: Removed 878 row(s) containing missing values (geom_path).
Create bar plot of TSLA stock volume
p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
ggplot(aes(x = Date, y = Volume)) +
geom_col() +
labs(title = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time",
caption = "© Data Science & Technology Club HSG") +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar,
breaks = seq(from = 0, to = 70e6, by = 10e6))
p_bar_Tesla_stock_volume
We can play with the width argument in geom_col to adjust the width of the bins plotted.
p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
ggplot(aes(x = Date, y = Volume)) +
geom_col(width = 0.2) +
labs(title = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time",
caption = "© Data Science & Technology Club HSG") +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar,
breaks = seq(from = 0, to = 70e6, by = 10e6))
p_bar_Tesla_stock_volume
Compute stock returns
df_Tesla_stock_data <- df_Tesla_stock_data %>%
mutate(Return = log(Adjusted) - log(lag(Adjusted)))
df_Tesla_stock_data %>%
ggplot(aes(x = Return)) +
geom_histogram(bins = 500,
col = palette()[8]) +
geom_density(col = palette()[1]) +
labs(title = "Histogram - Tesla Stock Returns",
x = "Continuous Returns",
y = "Count")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).
Previous time series chart with plotly
# FIXME: Annotation doesn't work yet
p_time_series_Tesla_vs_SPY <- p_time_series_Tesla_vs_SPY %>%
ggplotly() %>%
layout(annotations = list(x = 1, y = 1,
text = "© Data Science & Technology Club HSG"))
p_time_series_Tesla_vs_SPY
Compute number of Elon Musk’s tweets per day and show summary statistics
df_tweets_elon_musk_per_day <- df_tweets_elon_musk %>%
mutate(Date = as.Date(created_at)) %>%
group_by(Date) %>%
summarise(TweetsN = n())
# TODO: Add a caption to table
df_tweets_elon_musk_per_day %>%
summarise(Min = min(TweetsN, na.rm = T),
FirstQuartile = quantile(TweetsN, probs = 0.25),
Median = median(TweetsN, na.rm = T),
ThirdQuartile = quantile(TweetsN, , probs = 0.75),
Max = max(TweetsN, na.rm = T)) %>%
datatable()
# Create a bar chart
# TODO: Set own color scale
p_bar_tweets_elon_musk <- df_tweets_elon_musk_per_day %>%
ggplot(aes(x = Date, y = TweetsN, fill = TweetsN)) +
geom_col() +
scale_x_date(date_breaks = "1 month",
date_labels = "%Y %b") +
scale_y_continuous(breaks = seq(0, 60, 10)) +
labs(title = "Tweets by Elon Musk",
x = "Month",
y = "Number of Tweets") +
scale_fill_binned(type = "viridis") +
theme(axis.text.x = element_text(angle = 60,
hjust = 1))
p_bar_tweets_elon_musk <- p_bar_tweets_elon_musk %>%
ggplotly()
subplot(p_time_series_Tesla_vs_SPY,
p_bar_tweets_elon_musk,
nrows = 2,
shareX = T)
# Get Tesla tweets
df_tweets_elon_musk_Tesla <- df_tweets_elon_musk %>%
filter(str_detect(text, pattern = "Tesla"))
# TODO
# wordcloud2()
# OHLC
# TODO: Add nicer colors
p_LC <- df_Tesla_stock_data %>%
ggplot(aes(x = Date, y = Adjusted)) +
geom_line(size = 1) +
geom_line(aes(y = Low),
col = palette()[1],
linetype = "dashed") +
geom_line(aes(y = High),
col = palette()[8],
linetype = "dashed") +
geom_ribbon(aes(ymin = Low,
ymax = High),
alpha = 0.4) +
labs(title = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time") +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar)
p_LC %>%
ggplotly()